assembly task
PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly
Ma, Liang, Wen, Jiajun, Lin, Min, Xu, Rongtao, Liang, Xiwen, Lin, Bingqian, Ma, Jun, Wang, Yongxin, Wei, Ziming, Lin, Haokun, Han, Mingfei, Cao, Meng, Chen, Bokui, Laptev, Ivan, Liang, Xiaodan
While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples, collectively aimed at evaluating progressive spatial reasoning and fundamental physical comprehension, including object properties, spatial relationships, and holistic scene understanding. PhyBlock includes 2600 block tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three key dimensions: partial completion, failure diagnosis, and planning robustness. We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning. Our empirical findings indicate that the performance of VLMs exhibits pronounced limitations in high-level planning and reasoning capabilities, leading to a notable decline in performance for the growing complexity of the tasks. Error analysis reveals persistent difficulties in spatial orientation and dependency reasoning. Surprisingly, chain-of-thought prompting offers minimal improvements, suggesting spatial tasks heavily rely on intuitive model comprehension. We position PhyBlock as a unified testbed to advance embodied reasoning, bridging vision-language understanding and real-world physical problem-solving.
- Asia (0.15)
- North America > United States (0.14)
- Retail (0.53)
- Information Technology (0.46)
VLM-driven Skill Selection for Robotic Assembly Tasks
Kim, Jeong-Jung, Koh, Doo-Yeol, Kim, Chang-Hyun
Robotic assembly tasks represent one of the most challenging problems in robotics, requiring precise manipulation capabilities combined with sophisticated reasoning about complex multi-step processes. Unlike simple pick-and-place tasks, assembly tasks demand long-term planning that spans multiple sequential actions, where each step must be carefully coordinated with previous and subsequent operations. Furthermore, these tasks require physical understanding of component interactions and spatial relationships between parts [1], [2], [3]. Vision-Language Models (VLMs) have emerged as powerful tools that bridge visual perception and high-level reasoning, offering significant advantages for robotic applications. These models excel at processing visual information while understanding natural language instructions, making them well-suited for complex manipulation tasks.
- Research Report (0.64)
- Workflow (0.48)
Rectified Point Flow: Generic Point Cloud Pose Estimation
Sun, Tao, Zhu, Liyuan, Huang, Shengyu, Song, Shuran, Armeni, Iro
We introduce Rectified Point Flow, a unified parameterization that formulates pairwise point cloud registration and multi-part shape assembly as a single conditional generative problem. Given unposed point clouds, our method learns a continuous point-wise velocity field that transports noisy points toward their target positions, from which part poses are recovered. In contrast to prior work that regresses part-wise poses with ad-hoc symmetry handling, our method intrinsically learns assembly symmetries without symmetry labels. Together with a self-supervised encoder focused on overlapping points, our method achieves a new state-of-the-art performance on six benchmarks spanning pairwise registration and shape assembly. Notably, our unified formulation enables effective joint training on diverse datasets, facilitating the learning of shared geometric priors and consequently boosting accuracy. Project page: https://rectified-pointflow.github.io/.
VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tuning
Huang, Binghao, Xu, Jie, Akinola, Iretiayo, Yang, Wei, Sundaralingam, Balakumar, O'Flaherty, Rowland, Fox, Dieter, Wang, Xiaolong, Mousavian, Arsalan, Chao, Yu-Wei, Li, Yunzhu
Humans excel at bimanual assembly tasks by adapting to rich tactile feedback -- a capability that remains difficult to replicate in robots through behavioral cloning alone, due to the suboptimality and limited diversity of human demonstrations. In this work, we present VT-Refine, a visuo-tactile policy learning framework that combines real-world demonstrations, high-fidelity tactile simulation, and reinforcement learning to tackle precise, contact-rich bimanual assembly. We begin by training a diffusion policy on a small set of demonstrations using synchronized visual and tactile inputs. This policy is then transferred to a simulated digital twin equipped with simulated tactile sensors and further refined via large-scale reinforcement learning to enhance robustness and generalization. To enable accurate sim-to-real transfer, we leverage high-resolution piezoresistive tactile sensors that provide normal force signals and can be realistically modeled in parallel using GPU-accelerated simulation. Experimental results show that VT-Refine improves assembly performance in both simulation and the real world by increasing data diversity and enabling more effective policy fine-tuning. Our project page is available at https://binghao-huang.github.io/vt_refine/.
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals via Vision-Language Models
Tie, Chenrui, Sun, Shengxiang, Lin, Yudi, Wang, Yanbo, Li, Zhongrui, Zhong, Zhouhan, Zhu, Jinxuan, Pang, Yiman, Chen, Haonan, Chen, Junting, Wu, Ruihai, Shao, Lin
Assembly hinges on reliably forming connections between parts; yet most robotic approaches plan assembly sequences and part poses while treating connectors as an afterthought. Connections represent the critical "last mile" of assembly execution, while task planning may sequence operations and motion plan may position parts, the precise establishment of physical connections ultimately determines assembly success or failure. In this paper, we consider connections as first-class primitives in assembly representation, including connector types, specifications, quantities, and placement locations. Drawing inspiration from how humans learn assembly tasks through step-by-step instruction manuals, we present Manual2Skill++, a vision-language framework that automatically extracts structured connection information from assembly manuals. We encode assembly tasks as hierarchical graphs where nodes represent parts and sub-assemblies, and edges explicitly model connection relationships between components. A large-scale vision-language model parses symbolic diagrams and annotations in manuals to instantiate these graphs, leveraging the rich connection knowledge embedded in human-designed instructions. We curate a dataset containing over 20 assembly tasks with diverse connector types to validate our representation extraction approach, and evaluate the complete task understanding-to-execution pipeline across four complex assembly scenarios in simulation, spanning furniture, toys, and manufacturing components with real-world correspondence.
Refinery: Active Fine-tuning and Deployment-time Optimization for Contact-Rich Policies
Tang, Bingjie, Akinola, Iretiayo, Xu, Jie, Wen, Bowen, Fox, Dieter, Sukhatme, Gaurav S., Ramos, Fabio, Gupta, Abhishek, Narang, Yashraj
Simulation-based learning has enabled policies for precise, contact-rich tasks (e.g., robotic assembly) to reach high success rates (~80%) under high levels of observation noise and control error. Although such performance may be sufficient for research applications, it falls short of industry standards and makes policy chaining exceptionally brittle. A key limitation is the high variance in individual policy performance across diverse initial conditions. We introduce Refinery, an effective framework that bridges this performance gap, robustifying policy performance across initial conditions. We propose Bayesian Optimization-guided fine-tuning to improve individual policies, and Gaussian Mixture Model-based sampling during deployment to select initializations that maximize execution success. Using Refinery, we improve mean success rates by 10.98% over state-of-the-art methods in simulation-based learning for robotic assembly, reaching 91.51% in simulation and comparable performance in the real world. Furthermore, we demonstrate that these fine-tuned policies can be chained to accomplish long-horizon, multi-part assembly$\unicode{x2013}$successfully assembling up to 8 parts without requiring explicit multi-step training.
- North America > United States > California (0.14)
- North America > United States > Massachusetts (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia (0.15)
- North America > United States (0.14)
- Retail (0.53)
- Information Technology (0.46)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > New York > New York County > New York City (0.04)